KernelSHAP
amortization
application to case study
Computing Shapley values exactly is intractable except in trivial cases.
How to compute:
\[v_{x}(S) = \mathbb{E}_{p(x'_{S^C} \mid x_{S})}[f(x_{S}, x'_{S^C})]\]
?
Replace \(x'_{S^C}\) with fixed reference (zeros, mean \(\bar{x}_{S^C}\)):
\[v_{x}(S) \approx f(x_{S}, \bar{x}_{S^C})\]
One evaluation per \(S\). Very rough approximation – ignores correlations.
Assume \(x_{S} \perp x_{S^C}\):
\[v_{x}(S) = \mathbb{E}_{p(x'_{S^C})}[f(x_{S}, x'_{S^C})] \approx \frac{1}{N}\sum_{n=1}^{N} f(x_{S}, x'_{n,S^C})\]
Sample or subsample from marginal.
Can be shown that Shapley values solve a weighted regression:
\[v_{x}(S) \approx \varphi_{x}(f, \emptyset) + \sum_{d=1}^{D} \mathbb{1}_{d \in S}\varphi_{x}(f, d)\]
\(\mathbb{1}_{d \in S}\) is known. Linear regression with unknown \(\varphi_{x}(f, d)\). Compute \(v_{x}(S)\) on sampled subsets.
Each row: subset \(S\) of features.
\[v_{x}(S) \approx \varphi_{x}(f, \emptyset) + \sum_{d=1}^{D} \mathbb{1}_{d \in S}\varphi_{x}(f, d)\]
Weight for subset \(S\): \[\frac{D - 1}{\binom{D}{|S|} |S|(D - |S|)}\]
Weighted regression recovers exact Shapley values \(\varphi_{x}(f, d)\).
KernelSHAP runs per-instance: thousands of model evaluations per data point.
For millions of samples, this cost dominates.
Train explainer model \(g_\theta(x)\) to predict \(\phi(x)\) directly.
Speed. One forward pass after training.
Generalization. Exploits similarities between data points.
Training requires labels \(\phi(x)\), which we want to avoid computing.
Stochastic amortization. Use cheap, noisy estimates \(\tilde{a}(b)\):
\[\tilde{\mathcal{L}}_{reg}(\theta) = \mathbb{E}[\|a(b;\theta) - \tilde{a}(b)\|^{2}]\]
Use high-variance estimates (KernelSHAP with 1-5 samples) as targets.
If \(\mathbb{E}[\tilde{a}(b) \mid b] = a(b)\), the model converges to correct attributions.
Slows convergence (high variance) but is unbiased.
Amortization beats per-example computation for \(n > 1000\).
(add a figure)
Analyze enhancers. DNA sequences regulating gene expression in Drosophila embryos.
Target. Enhancer status (1 vs. 0)?
Features. Transcription factor and chromatin mark levels.
Gradient Boosting using the mboost package. Tuned as in previous case studies.
Tuning. 3-fold CV optimizes learning rate (nu) and iterations (mstop) by AUC.
Before interpreting, verify model quality.
AUC. High values indicate learned biological signals.
Calibration. Histogram of predicted probabilities per class.
sv_importance identifies influential transcription factors.
Bicoid (bcd2). High values increase SHAP → major activator.
Twist (twi2). Strong positive impact, consistent with known role.
sv_force explains individual sequences.
Each prediction: balance of “pushes” from different genes.
Sequence may be enhancer primarily due to high bcd2, even if twi2 low.